The Suffix Sequoia
نویسنده
چکیده
Standard technologies for sequence searching do not use database indexes. These solutions can be divided into exhaustive algorithms, e.g. the Smith-Waterman algorithm [11], and heuristic ones, like BLAST [1, 2], FASTA [10], and BLAT [7]. Specialised tools for DNA matching exist, such as SIM4 [3] and SSAHA [9]. Only BLAT and SSAHA use indexing. BLAT can be used with proteins, however, its sensitivity is not as good as that of BLAST. To our knowledge, known indexes do not support full sensitivity searching. We showed [6] that the suffix tree produces an indexing gain but does not deliver good performance. Myers and Durbin [8] index the query with regard to the similarity matrix, but not the text itself. Significant effort has been made to use hardware parallelisation [5, 12, 13] and Field Programmable Gate Arrays 1 [4, 14]. Possible hardware solutions complement algorithmic solutions, and are orthogonal to the database approach which uses indexing. We propose a new data structure which combines the features of the suffix tree and of an array-based index, and we measure how well this structure performs in approximate matching. In comparison to [8], where 4% of the DP matrix is computed, and the entire text is scanned, we do not require a full text scan, but exhaustively scan a virtual index, while computing the DP matrix. We access the index on disk only if significant matches are found. In tests with 400 million AAs, and index window size 5, the upper bound on the size of the matrix computation is 1.6% 2 and, in practice, only part of the DP matrix is computed, due to the sparsity of the cost matrix.
منابع مشابه
Indexed Searching on Proteins Using a Suffix Sequoia
Approximate searching on protein sequence data under arbitrary cost models is not supported by database indexing technology. We present a new data structure, suffix sequoia, which reduces the time complexity of the dynamic programming (DP) matrix calculation required in approximate matching. The data structure is compact. It uses just over 4 Bytes per symbol indexed. We show that time complexit...
متن کاملCompact Suffix Trees Resemble PATRICIA Tries: Limiting Distribution of the Depth
Suffix trees are the most frequently used data structures in algorithms on words. In this paper, we consider the depth of a compact suffix tree, also known as the PAT tree, under some simple probabilistic assumptions. For a biased memoryless source, we prove that the limiting distribution for the depth in a PAT tree is the same as the limiting distribution for the depth in a PATRICIA trie, even...
متن کاملCapturing Petascale Application Characteristics with the Sequoia Toolkit
Characterization of the computation, communication, memory, and I/O demands of current scientific applications is crucial for identifying which technologies will enable petascale scientific computing. In this paper, we present the Sequoia Toolkit for characterizing HPC applications. The Sequoia Toolkit consists of the Sequoia trace capture library and the Sequoia Event Analysis Library, or SEAL...
متن کاملWood of Giant Sequoia: Properties and Unique Characteristics1
Wood properties of giant sequoia (Sequoia gigantea [Lindl.] Decne.) were compared with those for other coniferous tree species. Wood properties such as specific gravity, various mechanical properties, extractive content, and decay resistance of young-growth giant sequoia are comparable to or more fa vorable than those of coast redwood (Sequoia sempervirens [D. Don] Endl.). It is recommended th...
متن کاملLong-term Dynamics of Giant Sequoia Populations: Implications for Managing a Pioneer Species
My colleagues and I have analyzed the age structure of four populations of giant sequoia (Sequoiadendron giganteum [Lindl.] Buchholz). We have found the following: (1) The amount of successful reproduction in a grove cannot be judged by the sizes of its trees. (2) Sequoia populations almost certainly were near equilibrium or increasing before the arrival of European settlers. (3) In this centur...
متن کامل